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[57] ABSTRACT 

Voice activation of functions on a network such as the 
Internet are accomplished using a speech recognition system 
running synchronously with standard desktop-based Internet 
functions. This synchronous operation allows voice-based 
control to be exercised for all operations on the Internet. 
System functions are based on a unique combination of a 
local web browser, a remotely-located speech/web server, 
and control links between a web browser and a speech/web 
server. The control links provide a mechanism for control- 
ling a speech server from a web page and a mechanism for 
driving both the local, as well as a remote, web browser. 

19 Claims, 7 Drawing Sheets 
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USING SPEECH RECOGNITION TO ACCESS at the desktop level generally. Voice macros have been 

THE INTERNET, INCLUDING ACCESS VIA A created for a number of Windows functions for use on the 

TELEPHONE Internet. A macro is a series of functions on the computer 

activated by a single command. For a voice macro, the 

TITLE OF THE INVENTION 5 speech server's recognition of an inputted voice command 

r „. „ L n activates a series of commands. 

Method and System for Using Speech Recognition to r „ . . . , ... , , 

Access the Internet, including Access Via a Telephone. fc Fwo P nor « ****** for ^peech-enablmg the Internet 

have been explored by various companies and research 

FIELD OF THE INVENTION entities. In general terms, researchers have approached the 

i° problem from either the perspective of speech-enabling the 

The present invention relates to the field of computerized Internet, or from the perspective of Internet-enabling the 

communication on the Internet. The general purpose of this telephone system. 

invention is to enable speech access to the Internet over ^ fi ret m eihod is the most common approach and the 

standard telephone lines and Internet control of telephony one bcing pursued by Texas Instruments, Apple Computer, 

functions through standard web pages. This is accomplished is and Microsoft. In this approach, the speech recognition 

through a unique combination of speech server, web cngine ^ located on the local host, along with the web 

browser, and control links. The control links provide a browser. This approach allows such activities as those 

mechanism for controlling the speech server from a web described above — voice macros for Windows functions that 

page and a mechanism for driving both the local, as well as can ^ when browsing the Internet, 

a remote, web browser. 20 TcMS Insmiments refincd lhis approach Dy using 

BACKGROUND OF THE INVENTION ? e l ™ 1 " S80ci f ,ed ^ h ? tlink f to ^ ,he ^f" 1 ^ 

for the recognizer. Apple has taken the approach of making 

The Internet is essentially a network of servers containing both the web browser and the speech recognition engine 

information that users can obtain using personal computers. scriptable (controllable with the Apple Script language). 

Users generally connect to a server, a computer equipped 25 Microsoft has taken the approach of providing tools for web 

with information and capabilities that assist the user with page developers to allow them to speech-enable their web 

contacting other servers and obtaining additional informa- pages. These tools provide a mechanism for supplying the 

tion. Users typically execute these functions, also referred to recognizer with grammars and their speech synthesizers 

as "navigating" on the Internet, using a mouse and ^ with spoken prompts. 

Windows-based software. The user's navigation of the Inter- The advantages of the present invention over this method 

net is thus essentially graphically-based (looking at a screen) include: (1) telephone access serves a far greater potential 

with functions activated using a mouse. audience than speech access limited to desktop operations; 

Speech recognition software and hardware for use in (2) no additional requirements of the user's computer, such 

conjunction with personal computers and other 35 as a speech recognition cngine, are required; (3) the system 

environments, like the Internet, is a rapidly developing uses a migration path starting with an immediate utility with 

technology. With speech recognition, a user's voice com- no long-term limitations; and (4) direct bene fits are available 

mands are recognized by a computer and then converted, from telephony integration. 

based on the speech pattern, into an electronic signal. For Internet-enabling the telephone system is primarily being 
example, speech recognition has been highly successful in 4Q investigated as a research effort. Demonstrations from MIT 
the field of long-distance telephone calling for the purpose and the Sun SpeechActs group have shown potential for 
of allowing collect calls. Typically, with this application, a using a speech-only interface for retrieving personal infor- 
caller will provide a name and a phone number to a malion (voice e-mail) over the phone and for using the 
computer when making a collect call. The computer will Internet as an up-to-date repository of information available 
then place the caller on hold and call the number to be 4S over the phone. For example, ALTech, a commercial spin-off 
reached. The person receiving the collect call will answer of MIT, has demonstrated the use of a speech server for 
"yes" or "no" in response to the computer message and the obtaining information about local movies, 
collect caller's name. The voice recognition hardware and Advantages of the present invention over this method 
software, which is also known as a speech recognition include: (1) an optional Graphical User Interface (GUI) 
engine, either signals a switch to complete the call upon 5Q makes using the system with today's World Wide Web much 
recognizing the "yes" response, or to disconnect upon rec- more pra ciical and simple than attempting to do it with 
ognizing the u no" response, speech alone; (2) the potential user base is just as large over 
One Issue with using speech recognition is selecting the the long term; and (3) providing tools to other developers is 
appropriate speech recognition engine to use for a particular expected to lead to much more rapid progress than attempt- 
application. These speech recognition engines include 55 ing to build speech-only interfaces from the ground up. 
speaker dependent and independent dictation machines, 

continuous speech systems, large vocabulary systems, and SUMMARY OF THE INVENTION 

small vocabulary systems. Further, these systems can be ITiis invention links networks such as the Internet and the 

Windows based, Macintosh based, UNIX based, Windows World Wide Web to a speech recognition server, which 

NT based, or based on another platform, depending on the 60 res ides on the telephone system, to provide for speech access 

preferred operating system. to these networks over standard telephone lines and control 

Speech recognition operating in conjunction with com- of telephony functions through standard web pages. These 

puter connection with the Internet, also known as speech capabilities are accomplished through a combination of 

enabling of the Internet, appears to have promising appli- speech server (typical of those found in Interactive Voice 

cation possibilities. One possible application of this tech- 65 Response (IVR) applications), web browser, and control 

nology is for navigational purposes on the Internet. For links. The control links consist of software that provides a 

example, speech recognition has been successfully utilized mechanism for controlling the speech server from a web 
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page, and a mechanism for driving both the local, as well as 
a remote, web browser. 

An example of the capabilities of the system is as follows. 
A user seeking a service to provide stock quotes can access 
these quotes by graphically browsing the Internet to a web 
page that continually carries the quotes. Once at the web 
page, the user can activate the present invention, telling the 
speech server to, for example, "mark this" or ''show me the 
stock quote". The server can then be set to either tell the user 
the stock price or go to that web page upon recognizing of 
the selected speech pattern. 

The general purpose of this invention is thus to provide a 
method for linking a remote speech recognition device 
operating over the telephone network to any web browser 
operating over the Internet. This link enables the user's web 
browser to be controlled by the remote speech recognition 
device, and, in turn, enables telephony functions to be 
controlled by any web browser. In addition to providing an 
immediate solution to accessing the web by voice, the 
invention provides tools and motivation for web page 
authors to generate web pages that are tailored to speech- 
only interfaces. This is expected to transform the nature of 
the web, and, over time, to support a truly multi-modal 
interface with the Internet. 

The significance of the invention is that it provides both 
a means for immediately speech-enabling the Internet and a 
means for gradually Internet-enabling the telephone system. 
Other systems have approached the problem of linking 
speech technology and the Internet from either one perspec- 
tive or the other (that is, speech-enabling the net or net- 
enabling the telephone). The approach of the present 
invention, however, can be viewed from either perspective, 
and, in so doing, leads to an immediate speech-enabling of 
the Internet, and to a process of Internet-enabling the tele- 
phone. In addition, the present invention leads to function- 
ality completely unobtainable from either of the other 
approaches taken alone. 

The control of both the server's web browser and the 
user's remote web browser also enables an optional GUI for 
the user of the Speech/web server. The GUI link is not 
required for the system to operate; however, because the web 
is currently graphically-oriented, the ability to use the local 
web browser as a GUI for the speech-driven browser is 
expected to be beneficial when surfing the web by voice. The 
concept of a telephony-based web browser with an optional 
GUI constitutes a significant attribute of the system because 
it provides a common platform that can be used for simple 
applications by anyone with a telephone. In addition, it can 
be used for more difficult tasks when a PC or workstation is 
available to the user. 

Anoiher example of the use of the present indention 
pertains to speech input and output over telephone lines as 
(he additional modality that can be linked to the conven- 
tional web browser interface. Thus, rather than placing a 
call, hanging up, and placing another call, a user will be able 
to browse using the telephone. This browsing includes such 
activities as seamlessly speaking to one person, and then 
connecting to another, and then checking messages and 
ordering a pizza, all without hanging up and without ever 
dialing a number. The same method links any alternative 
user interface to the user's standard web browser. This 
pertains to browsers with teletypewriter (TTY) interfaces, 
browsers that understand and speak other languages, or even 
browsers capable of providing a sense of smell, sight, taste, 
and touch. 

Additional objects, ad van I ages and novel features of the 
invention will be set forth in part in the description which 
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follows, and in part will become more apparent to those 
skilled in the art upon examination of the following or may 
be learned by practice of the invention. The objects and 
advantages of the invention may be realized and attained by 
5 means of the instrumentalities and combinations particularly 
pointed out in the appended claims. 

To achieve the stated and other objects of the present 
invention, as embodied and described below, the invention 
may comprise the steps of: 
10 accessing a voice recognition server through a voice 
transmission device; 
such device translating voice transmissions into electronic 
signals; and 

15 using said translated voice transmissions to perform func- 
tions on the Internet via voice translation being per- 
formed by said server. 

BRIEF DESCRIPTION OF THE DRAWINGS 

20 A block diagram of the invention is shown in FIG. 1. 
FIG. 2 shows how a user that happens across the web page 

containing connection information on the present invention 

initiates the process of speech enabling his or her web 

browser using the preferred embodiment. 
25 FIG. 3 illustrates the exchange of information necessary 

to speech enable a web browser. 

FIG. 4 shows the connections in place for operation of the 

preferred embodiment. 
30 FIG. 5 illustrates all of the components of the system in 

operation. 

FIG. 6 Contains an alternative embodiment, in which the 
local web browser is a slave to the speech/web server. 

FIG. 7 contains a second alternative embodiment, in 
35 which the speech/web server is a slave to the local web 
browser. 

DETAILED DESCRIPTION OF THE 
PREFERRED EMBODIMENT 

40 

Using the drawings, the preferred embodiment of the 
present invention will now be explained. 

A block diagram of the invention is shown in FIG. 1. A 
local web browser 1, such as Netscape on a PC, is used to 

45 browse the Internet 2 using a conventional Transport Control 
Protocol (TCP) link 3. The local web browser 1 contains an 
Applied Speech Technology Protocol (ASTP) plugin 4, 
which communicates by ASTP link 5 with an ASTP con- 
troller 6 located within a specch/web browser 7 of a speech/ 

50 web server 8, such as a Pentium processor-based PC running 
Windows NT. This PC also hosts, or a separate PC coupled 
to the speech/web server 8 hosts, the speech server 9, which 
is coupled 10 to the ASTP controller 6. These couples can 
consist of such connections as an electronic circuit, a fiber 

5S optic line, an electromagnetic signal, or any other means of 
coupling known in the art. A Dialogic line card located in the 
backplane of the speech server 9 PC couples 11 the speech 
server to a telephone network 12. The speech/web browser 
7 is also TCP linked 13 to the Internet 2. 

60 The three major components of the speech/web server 
thus are the specch/web browser 7, the speech server 9 with 
telephony functions, and I lie ASTP controller software 4 and 
6. 

The speech/web browser 7 is a standard, off-the-shelf web 
65 browser with an ASTP plug-in 6. The ASTP plug-in 6, as 
described below, is a software program written in a 
language, such as JAVA, that allows the program to run 
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within a web browser, such as Netscape. However, since the 
speech/web browser 7 is driven by speech-only, it is always 
run in text-only mode. This gives it a considerable response 
time advantage over a browser that must download and 
display graphics. The time normally devoted to graphics can 
thus be used by the recognizer (speech server 4) lo compile 
the grammar for the new web page. 

The speech server 9 is typical of those used for IVR and 
operator assist applications. These systems vary consider- 
ably in the number of simultaneous channels of speech 
recognition they can support, but arc most often built from 
off-the-shelf components that plug into a PC (AT bus). A 
typical configuration for a speech server would be a Pentium 
class PC running UNIX or Windows NT, loaded with a 
speech recognizer such as ALTech, PureSpeech, or Nuance, 
with a Dialogic line card capable of handling multiple 
simultaneous telephone lines, and two speech recognition 
boards, each with four channels of recognition. Speech 
output is either from pre-recorded prompts or a speech 
synthesizer. The telephone line card enables the system to 
dial out, receive calls, and to conference calls. 

The ASTP software 4 and 6 is the heart of the system. As 
noted, this software is written and distributed as a plug-in 
module to Netscape or other browsers and is written in a 
typical software that can operate in Netscape, such as JAVA. 
The protocol is a superset of the Common Client Interface 
(CCI), which provides the mechanism for establishing a 
persistent link between the speechAveb browser 7 and the 
user's browser (local web browser 1). The persistent link 
enables the speech/web browser 7 to remotely control the 
user's web browser 1, the user's web browser 1 to control 
the speech/web browser 7, and also allows the two browsers 
1 and 7 to traverse the web in tandem. 

In addition to the CCI-likc capability, the ASTP protocols 
provide the interface to the speech server 9, telling the 
recognizer what grammar to compile for the next web page. 
This function is typically fulfilled by simply stripping the 
text associated with each hotlink and sending it lo the 
recognizer's grammar compiler. Alternatively, versions of 
the protocol support calls to high-level routines, called 
"speech behaviors", that handle all of the dialog between the 
user and the machine. These high-level routines allow users 
to supply, by voice, specific kinds of information when using 
the Internet, such as credit card numbers, addresses, and 
telephone numbers. By providing web page authors with 
access to well-designed dialog modules that can be easily 
deployed through simple-to-use web authoring tools, such as 
the ASTP protocols, the predominately graphical nature of 
the web changes to accommodate a speech-only, telephone- 
based interface. 

Finally, the ASTP link 5 is what provides the conduit 
between the web page and the telephone. This allows web 
authors to include telephone numbers associated with hot- 
links that can be dialed by the speech/web server 8. This 
capability may change how switching is currently done in 
the telephone network 12. 

FIG. 2 shows how a user that happens across the web page 
containing connection information on the present invention 
initiates the process of speech enabling his or her web 
browser using trie preferred embodiment. A user 15 using a 
local web browser 16 initiates a TCP connection 17 with the 
specchAvcb site, which is served by the specch/web server 
18, by selecting a hotlink such as "surf the web by voice" at 
the web site. 

In FIG. 3, user 15 of a local web browser 16 and local 
telephone 19 uploads 20 to the spcechAvcb server 18 from 
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the local web browser 16 the local telephone number and 
downloads 21 the ASTP plug-in from the speech/web server 
18. In FIG. 4, the user 15 of the local web browser 16 and 
local telephone 19 simultaneously connects by ASTP con- 
nection 17 and by telephone connection 22 with the speech/ 
web server 18. 

The setup of the preferred embodiment is now completed, 
as shown in FIG. 5. The user 15 of the local web browser 16 
and local telephone 19 simultaneously communicates with 
the speech/web server 18 via ASTP connection 17 and 
telephone connection 22. The user 15 is also connected by 
a TCP link 25 to other web servers 24 simultaneously 26 
with the speech/web server 18 connection by a TCP link 23 
with those other web servers 24. 

As a result of these simultaneous links 26, the user can 
browse the Internet using voice while looking at the screen 
of the local web browser 16 and speaking over the phone 19. 
20 Typically these links allow a user to speak into the phone 
using words within the system's capability. These words are 
recognized and interpreted by the speech/web browser 
located at the speech/web server 18 and translated into a 
TCP link 23 command for the speech/web browser at the 
spcechAvcb server 18. At the same time, the ASTP supplies 
the same TCP link command 17 on the local web browser 
16. Thus, the user 15 speaks to control browsing of (he 
Internet. 

30 A significant advantage of the preferred embodiment is 
responsiveness. The dual link approach allows time for the 
speechAveb server to generate grammars while the user's 
browser is busy displaying graphics. A secondary advantage 
is that neither of the web browsers need to be modified for 

3S the system to work. 

Variation and Modifications 

Two variations on the invention are illustrated in FIGS. 6 
and 7. These approaches differ from the one described in 

40 FIG. 1 in that they require only a single link into the Internet, 
rather than the two links described previously. 

In the method shown in FIG. 6, the local web browser I 
with ASTP phig-in 4 is linked 5 to an ASTP controller 6 

4S located within a speech/web browser 7 housed within a 
Pentium processor PC-based speechAveb server 8. This PC 
is typically running Windows. This PC also hosts, or a 
separate PC coupled to the speechAveb server 8 hosts, the 
speech server 9, which is coupled 10 to the ASTP controller 

50 6. The speech server 9 is linked It to a telephone network 
12. The speechAveb browser 7 is also TCP linked 13 to the 
Internet 2. 

The primary difference between this alternative and the 
earlier embodiment (FIG. 1) is that a direct link 13 does not 

55 exist between the speechAveb browser 6 and the Internet 2 
simultaneous with a link between the local web browser 1 
and the Internet 2 (link 3 of FIG. 1). 
In the method shown in FIG. 7, the local web browser 1 

OT with ASTP phig-in 4 is linked 5 to an ASTP controller 6 
located within a speechAveb browser 7 housed within a 
Pentium processor-based PC speechAveb server 8. This PC 
also hosts, or a separate PC coupled to the speech/web server 
8 hosts, the speech server 9, which is coupled 10 to the ASTP 

6$ controller 6. The speech server 9 is linked 11 to a telephone 
network 12. The local web browser 1 is also TCP linked 3 
to the Internet 2. 
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The primary difference between ihis alternative and the 
earlier embodiment (FIG. 1) is that a direct link does not 
exist between the speech/web browser 6 and the Internet 2 
(link 13 of FIG. 1) simultaneous with a link 3 between the 
local web browser I and the Internet 2. 

What is claimed is: 

1. A remote server to enable a local user to increase the 
functionality of a local browser having a graphical user 
interface, comprising: 

a remote web browser residing on the remote server; 

a speech controller electronically coupled to said remote 
web browser, said controller being configured to form 
control links coupling the local browser to said remote 
browser via an Internet data communication link to 
enable said remote web browser and the local browser 
to function cooperatively; and 

a speech server having a speech recognition function 
residing on the remote server, said speech server cou- 
pling said controller to a telephone network so that a 
telephonic voice communication link may be estab- 
lished between the user and said controller; 

wherein voice commands to control browsing may be 
input via said telephonic voice communication link and 
wherein graphical user interface commands to control 
browsing may also be input via the local browser. 

2. The server of claim 1, wherein said controller and said 
server are configured to form said telephonic voice commu- 
nication link in response to the user accessing a web site via 
said Internet data communication link. 

3. The server of claim 1 wherein said control links are 
configured to enable the local browser to control the tele- 
phonic function of said speech server. 

4. The server of claim 1, wherein said controller is a 
software module contained in said remote browser. 

5. The server of claim 4, wherein said controller is 
configured to download a software program to the local 
browser to form persistent control links. 

6. A remote server to enable a local user to increase the 
functionality of a local browser, comprising: 

a remote web browser residing on the remote server, 

a speech controller electronically coupled to said remote 
web browser, said controller being configured to form 
control links coupling the local browser to said remote 
browser via an Internet data communication link to 
enable said remote web browser and the local browser 
to function cooperatively; and 

a speech server having a speech recognition function 
residing on the remote server, said speech server cou- 
pling said controller to a telephone network so that a 
voice communication link may be established between 
the user and said controller; 

wherein said control links arc configured to enable voice 
commands to be uploaded to control the browsing 
function while information from the Internet is down- 
loaded to a graphical user interface of the local browser. 

7. The server of claim 6, wherein said control links are 
configured so that the user may browse by both voice 
commands and by inputting commands via said graphical 
user interface. 

8. A network system, comprising: 

a) a local browser disposed on a local computer; and 

b) a remote server including: 

i) a remote browser residing on said remote server, 

ii) a speech controller software module electronically 
coupled to said remote browser, and 



iii) a speech server having a speech recognition func- 
tion residing on the remote server, said speech server 
coupling said speech controller software module to a 
telephone network so that a voice communication 
5 link may be established between the user and said 

speech controller software module; 
said controller software module having an interface pro- 
tocol for remotely controlling web browsers configured 
to form control links coupling said local browser to said 
10 remote browser via a network data link to enable said 
remote web browser and said local browser to function 
cooperatively, wherein said control links are configured 
so that auxiliary voice commands may be input by the 
user to control browsing of the network, 
is 9. The system of claim 8, wherein said local browser 
includes a graphical user interface and said control links are 
configured so that the user may browse by both voice 
commands and by inputting commands via said graphical 
user interface. 
20 10. A network system, comprising: 

a) a local browser disposed on a local computer; and 

b) a remote server including: 

i) a remote browser residing on said remote server; 

ii) a speech controller software module electronically 
25 coupled to said remote browser, said controller soft- 
ware module being configured to form control links 
coupling said local browser to said remote browser 
via a network data link to enable said remote web 
browser and said local browser to function coopera- 
tively; and 

iii) a speech server having a speech recognition func- 
tion residing on the remote server, said speech server 
coupling said speech controller software module to a 
telephone network so that a voice communication 
link may be established between the user and said 
speech controller software module; 

wherein said control links are configured to enable voice 
commands to be uploaded to control the browsing 
function while information from the network is down- 
loaded to the graphical user interface of said local 
browser. 

11. The system of claim 10 wherein said controller 
software module includes an interface protocol for remotely 
controlling a web browser. 

12. A method for permitting a local user to link a local 
web browser to a remote speech recognition device, com- 
prising the steps of: 

a) electronically coupling the local browser to a web-site 
served by a remote server; 

b) downloading a software program from a remote web 
browser residing on said remote server to form control 
links between the local web browser and a controller 
coupled to said remote web browser; and 

c) telephoning the user to form a voice communication 
link between the user and said controller via a speech 
server coupling said controller to a telephone network; 

whereby the user may input voice commands which are 
translated by said speech server to control browsing of 
60 a computer network while information from the net- 
work is downloaded to a graphical user interface of the 
local browser. 

13. The method of claim 12, further comprising after step 
"b" the step of: uploading the phone number of the local 

65 user. 

14. The method of claim 12, wherein said controller 
software module is contained in said remote web browser. 
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15. A method for permitting a local user to use voice 
commands to perform functions on a network, comprising 
the steps of: 

a) providing a remote server, the remote server having a 
controller for forming a first data communication link 
with a local user and a speech server for converting 
voice commands into control signals; 

b) accessing said remote server to form a first electronic 
communication link to a local browser, 

c) telephoning the user to form a voice transmission 
communication link coupling the user to the controller 
via said speech server; 

d) translating voice commands into electronic data signals 
using said speech server; and 

e) using said translated voice commands to perform 
functions on the network; 

wherein said controller is configured to enable voice 
commands to be uploaded to control the browsing 
function while information from the network is down- 
loaded to a graphical user interface of the local browser. 



16. The method of claim 15, wherein the network com- 
prises the Internet and further wherein the controller is 
contained in a remote browser residing on said remote 
server. 

5 17. The method of claim 16, wherein said speech server 
is coupled to a telephone network and further comprising 
after step "b" the step of: 
uploading a local telephone number. 

18. The method of claim 15, further comprising the step 

io of: 

downloading a software program to said local browser to 
enable a persistent link to be formed between the 
controller and the local browser. 

19. The method of claim 15, wherein said speech server 
15 is coupled to a telephone network and further comprising the 

step of: 

accessing a hot-linked phone number on a web-site to 
initiate dialing of said phone number by said speech 
server. 

20 



Page 13 (KYao, 11/17/2000, EAST Version: 1.01.0015) 



